Video-Text Pre-training with Learned Regions for Retrieval
نویسندگان
چکیده
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs via aligning the semantics between visual and textual information. State-of-the-art approaches extract features raw pixels in an end-to-end fashion. However, these methods operate frame-level directly thus overlook spatio-temporal structure of objects video, which yet has a strong synergy with nouns descriptions. In this work, we propose simple effective module for representation learning, namely RegionLearner, can take into account during on pairs. Given our (1) first quantizes continuous clustering patch-features same cluster according to content similarity, then (2) generates learnable masks aggregate fragmentary regions complete semantics, finally (3) models dependencies different semantic regions. contrast using off-the-shelf object detectors, proposed does not require explicit supervision is much more computationally efficient. We pre-train approach public WebVid2M CC3M datasets. Extensive evaluations four downstream retrieval benchmarks clearly demonstrate effectiveness RegionLearner.
منابع مشابه
Learned Lexicon-Driven Interactive Video Retrieval
We combine in this paper automatic learning of a large lexicon of semantic concepts with traditional video retrieval methods into a novel approach to narrow the semantic gap. The core of the proposed solution is formed by the automatic detection of an unprecedented lexicon of 101 concepts. From there, we explore the combination of query-by-concept, query-by-example, query-bykeyword, and user in...
متن کاملAutomatic text regions location in video frames
Content-based information retrieval from digital video databases and media archives is a challenging problem and is rapidly gaining widespread research and commercial interest. For a reliable retrieval and intelligent access to video programs, indexing should provide semantic descriptors. One way to include more semantic knowledge into the indexing process is to use the text embedded within ima...
متن کاملA Pre-viewing Step in Video Retrieval
Video files are very complex objects. For many years, researchers developed models to allow for search-and-retrieval systems specific for these objects. Since the results of a query will be a set of videos or of segments of videos, their size may be prohibitive, and do not allow for pre-validation before downloading. Moreover, many features of the video files for example the multiplicity of the...
متن کاملVideo Information Retrieval: Lessons Learned with the Informedia Digital Video Library
Video contains multiple types of audio and visual information, which are difficult to extract, combine or trade-off in general video information retrieval. This paper provides an evaluation on the effects of different types of information used for video retrieval from a video collection. A number of different sources of information are present in most typical broadcast video collections and can...
متن کاملFast Video Retrieval under Sparse Training Data
Feature selection for video retrieval applications is impractical with existing techniques, because of their high time complexity and their failure on the relatively sparse training data that is available given video data size. In this paper we present a novel heuristic method for selecting image features for video, called the Complement Sort-Merge Tree (CSMT). It combines the virtues of a wrap...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i3.25414